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Abstract 

A sequence of simulations was carried out to aid in the 
diagnosis and interpretation of equating differences found 
between random and matched (nonrandom) samples for four commonly 
used equating procedures: Tucker linear observed-score 

equating, Levine equally reliable linear observed- score 
equating, Equipercentile curvilinear observed-score equating, 
and IRT curvilinear true -score equating. The results support 
the prediction based on theoretical grounds that observed- score 
equating methods are more affected by sample variation than are 
true-score equating methods. These results further suggest that 
matching equating samples on the basis of fallible measures of 
ability may not be advisable for any conventional equating 
method except the Tucker method. In addition, the results 
support a particular hypothesis about IRT equating , suggesting 
that the use of matched samples cannot be recommended for this 
equating method either. 
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Factors Affecting the Sample Invariant Properties of 
Linear and Curvilinear Observe^ and True-Score 
Equating Procedures 
INTRODUCTION 

For several decades, psychometricians have discussed and 
debated whether or not linear observed- score equating procedures 
such as the Tucker equating model (see Angoff, 1971) can provide 
invariant results when new and old form samples used in the 
equating differ in ability level. Levine (1955) developed a 
linear true -score equating model that was defamed to be more 
robust to differences in ability level of old and new form 
samples than the Tucker method. In the 1980' s, IRT true-score 
equating (see Lord, 1980) won many advocates because of its 
claim to provide sample invariant equating results, provided the 
IRT model used fit the data and item parameters were adequately 
estimated. In the past few years, a number of studies have been 
performed to investigate the sample invariant properties of 
linear and IRT equating procedures (for example, Angoff & 

Cowell, 1986; Kingston, Leary, & Nightman, 1985; Cook, Eignor, & 
Taft, 1988); these studies have been reviewed and contrasted in 
a recent paper by Cook and Petersen (1987) . 
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Lawrence and Dorans (1988) recently provided information 
addressing the sample invariant properties of Tucker and Levine 
linear equating and Equipercentile through an anchor test 
(Design V in Angoff , 1971) and three parameter logistic (3-PL) 
model IRT curvilinear equating in the context of equating the 
Scholastic Aptitude Test (SAT). Because the study to be 
described in this paper may be viewed in certain ways as an 
extension of the Lawrence and Dorans study, some of the details 
of the standard SAT data collection design and the matching 
process employed by Lawrence and Dorans in their study will be 
reviewed before results of the Lawrence and Dorans study will be 
discussed. 

Figure 1 depicts the basic SAT equating data collection 
design, which essentially represents an equating design linking 
the new form, labelled NEW, to two old forms OLDl and 0LD2 . The 
specific old forms to be useu n the equating are established in 
the SAT braiding plan (Angoff, 1974); in general, the 
populations taking forms NEW and OLDl will be populations of 
similar ability (data for form OLDl will have been collected at 
the same administration during a previous year as form NEW) , 
while the group of examinees taking form 0LD2 will represent 
either a more or less able candidate population (data for form 
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0LD2 will have been collected at a different administration 
during a previous year than form NEW). Form NEW is linked to 
OLDl via one anchor test (EQl) and to 0LD2 via another anchor 
test (EQ2) . Typically, the average of the anchor equatings to 
the two old forms is taken as the operational conversion for the 

new form. 

In the Lawrence and Dorans (1988) study, the authors 
focused on the equating of NEW to form 0LD2 , and in addition to 
performing the usual linear, Equipercentile through an anchor 
test, and 3 -PL IRT equatings based on new and old form random 
samples that differ considerably in ability, matched sample 
equatings were also performed. In the matched sample equating 
of NEW to 0LD2 , the sample taking 0LD2 (sample 4 in Figure 1) is 
chosen in a non- random fashion so that the old form distribution 
of scores on the anchor test (EQ2) matches the observed- score 
distribution of the new form equating sample (sample 2). Thus, 
while the observed- score distribution for the new form sample is 
the naturally occurring distribution, the observed- score 
distribution for the old form sample is altered under matched 
sample conditions to be similar to that of the new form sample. 
This matching procedure is seen as a means for controlling for 
the possible effects of ability level differences on equating 

BEST COPY AVAILABLE 
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results. Lawrence and Dorans were then able to compare the 
random sample and matched sample equating results to determine 
which linear and curvilinear equating procedures provided the 
most and least invariant results. 



Insert Figure 1 about here 



Lawrence and Dorans (1988) studied and compared random and 
matched sample linear (Tucker and Levine) , Equipercentile 
through an anchor test, and 3-PL IRT equatings (of NEW to 0LD2) 
for nine forms of SAT -Mathematical and six forms of SAT-Verbal. 
The equating results, particularly scaled score means produced 
by the equating methods, revealed that the IRT true-score 
equating method was less robust to differences in group ability 
than expected, i.e., equating results for this method differed 
between the matched and unmatched (random) conditions. The 
Levine and Equipercentile through an anchor test equating 
results also differed considerably in certain equatings studied 
across the matched and random conditions. Interestingly, the 
Tucker observed- score equatings appeared more invariant across 
the matched and unmatched samples than any of the other methods. 
This was particularly true for the SAT- Mathematical equatings 
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studied, where there was little to no variation in scaled score 
means produced by the Tucker equating across the matched and 
random conditions. For SAT-Verbal, some variation in scaled 
score means resulting from the Tucker equatings was observed, 
but the sizes of the differences between the matched and random 
conditions was always less for Tucker than for other procedures. 
Further, while the four equating procedures frequently produced 
differing scaled score means under the random sample conditions, 
use of the anchor test as a direct selection variable for 
matching purposes produced a convergence of scaled score means 
across the four equating procedures. Lawrence and Dorans 
offered possible explanations for differences in equating 
results for all the procedures studied. Certain of these 
explanations, particularly the explanation for the IRT results, 
will be discussed later in this paper. 

Consistency of equating results, and particularly scaled 
score means, across random and matched sample conditions was 
used as the criterion in the Lawrence and Dorans study. One 
potential problem with using consistency as the criterion is 
that consistent equating results may be disparate from the 
"true" equating results, were they known. In other words, the 
consistent Tucker equating results might have been more 



•:o 



Equating Procedures 
8 

disparate from the "true” equating results in the Lawrence and 
Dorans study than the inconsistent Levine or IRT equatings . 
Knowledge of "true" equating results suggests the need for a 
simulation study. 

One recent simulation study supplied some useful results 
when considering the lack of invariance of the 3 -PL IRT 
equatings. Stocking and Eignor (1986) showed that differences 
around one standard deviation between IRT equating sample mean 
abilities can have substantial effect (a five scaled score point 
difference) on the SAT mean scaled score when compared to 
results for samples not differing in mean ability and to "true" 
results. However, most of the random and matched sample 
equatings studied by Lawrence and Dorans (1988) showed as great 
or greater differences in score means as the Stocking and Eignor 
(1986) study although there were smaller differences in sample 
mean abilities. Hence the differences or lack of invariance of 
the 3“PL IRT equating results in the Lawrence and Dorans (1988) 
study suggests the design of a simulation study where more 
variables can be studied than simply ability level differences. 

The goal of the present study was to develop a general 
simulation model and then perform a sequence of simulations and 
subsequent equatings based on the model that would address 
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specific issues in the application of both conventional and IRT- 
based equating methodologies, many of which were brought out in 
the Lawrence and Dorans (1988) study. More specifically, the 
purpose of the study was to investigate, using a sequence of 
simulations, the impact on four equating procedures of: 1) 

differences in abilities of samples used for equating, both when 
each examinee has complete data (an unrealistic setting) and 
also in the presence of missing data (a more realistic setting) ; 
2) subsequent matching of samples on an infallible measure of 
ability (an unrealistic setting); and 3) subsequent matching of 
samples on a fallible measure of ability (a more realistic 
setting) . 

THE STUDY DESIGN 

The Definition of True Item and Person Parameters 
For the sequence of simulations described in this paper , 
true item and person parameters are required. They could, of 
course, be invented. It is more realistic, however, to use 
existing parameter estimates, but treat them as if they were 
true. It seems reasonable to assume that such a definition of 
truth captures at least some of the predominant features of 
actual data, such as the spread of abilities and item 
difficulties. For this purpose, the results of a LOGIST 
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calibration (Wingersky, Barton, & Lord, 1982) of a single 85- 
item SAT-Verbal test form (administered in two separately timed 
sections) plus a 45-item associated anchor test or equating 
section were used as the true item parameters. Descriptive 
statistics for these true item parameters are shown in Table 1. 



Insert Table 1 about here 



True person parameters were defined to be the ability 
estimates obtained when a sample of N - 3018 real examinees took 
the Verbal form and its associated equating section . Two 
population distributions of true ability were define! for this 
study. The first was defined to be exactly like the 
distribution of true person parameters, with mean true ability 
of -.02 and standard deviation of true ability equal to 1.07. A 
second population was defined to be less able, with mean true 
ability of -.35, and the same standard deviation. 

For the purposes of this study, a total of six independent 
samples of size N * 3000 were drawn, as follows: 
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Drawn from 


Sample Mean 


Sample Standard 


Sample 

1 


Population 


Ability 


Deviation of Ability 


1 


-.01 


1.06 


2 


1 


- .03 


1.08 


3 


1 


- .02 


1.06 


4 


1 


.01 


1.08 


5, 


2 


- .37 


1.06 


6^ 


2 


- .06 


1.08 




The Generation 


of Complete Response Data 



Two types of response data were generated for each 
simulated examinee (simulee) . In this section, we discuss the 
generation of complete response strings; in a subsequent 
section, we describe the incorporation of missing data. 

To generate responses to an item for a simulee, the 
simulee 's true ability and the item's true 3-PL parameters are 
used to generate the model predicted probability of a correct 
response (see Lord, 1980). A random number is then selected 
from a uniform [0,1] distribution and compared to this model 
probability. If the random number is less than the modeled 
probability, the simulee is assigned a correct response to the 
item; if the random number is greater than the modeled 
probability, the simulee is assigned an incorrect response. 

^Sample 6 was matched to Sample 2 using the observed 
formula- score distribution of Sample 2 on the anchor test. 
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This response string may be referred to as the true model- 
generated response string s It represents what the model says 
about examinee behavior for every item. 

Models for Missing Response Data 
Real examinees rarely have complete data. Data can be 
absent from a response string for at least two reasons. The 
examinee may not have had time to examine all test items, and 
therefore fails to respoiid to a block of items at the end of a 
test. This type of missing data is referred to as 'not- 
reached' , A second type of missing data occurs, particularly in 
formula scored tests, where an examinee may decide to omit an 
item because the examinee thinks that she/he can only respond at 
random. For whatever reason responses are missing, it seems 
most likely that the existence and patterns of missing data in 
response strings may be a function of the ability the test is 
designed to measure. This clearly violates the assumptions of 
the 3“ PL model, and will almost certainly have some effect on 
calibration and equating results. It seems reasonable to 
attempt to incorporate this type of examinee behavior as one of 
the aspects to be studied in these simulations. 

The mathematical modeling of missing responses is a complex 
and difficult process involving assumptions about the behavior 
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of examinees that may be difficult to test. This is clearly 
beyond the scope of the present paper. It is possible, however, 
to develop empirically“based models of missing data that malce up 
for what they lack in generality by their close resemblance to 
real SAT data. It is important to note that, because the models 
proposed below are empirically based, they favor no particular 
treatment of missing data as incorporated into specific 
calibration procedures. 

An Empirically-Based Model of S peededne_s_s 

We wish to model speededness as a function of ability. To 
do this we need the actual item responses from each real 
examinee included in the calibration that produced our true item 
and person parameters. We can call these data the true respons e 
strings . We also need the true ability for each real examinee. 
Using the true response strings and true ability, we build a 
model of speededness only once, in advance of the simulations, 
for each separately timed test section. For each quintile of 
the distribution of true ability, we determine the cumulative 
distribution of the number of items reached for all examinees in 
the quintile. These conditional distributions will differ by 
ability level , and collectively they constitute our empirically- 
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based model. To incorporate this model in subsequent 
simulations, we proceed as follows: 

1) Find the quintile into which a simulee's true ability 
falls. 

2) Generate a random number between zero and one. 

3) In the correct conditional distribution, find the 
cumulant that most closely matches the random number. 

4) Find the corresponding number of items reached. 

5) Assume that subsequent items are not reached for a 
simulee, and code 3's (the LOGIST code for not reached items) in 
the remainder of the model -generated response string for this 
simulee . 

Figures 2, 3, and 4 show these empirically-based models 
separately for each separately timed section. In each of these 
figures, the frequency distribution of true abilities is plotted 
upside down; values of these proportional frequencies must be 
read from the right-hand vertical scale. This frequency 
distribution is divided into quintiles by the dotted vertical 
lines. In each figure, a solid vertical line is plotted at the 
midpoint of each quintile to serve as the x-axis for the 
cumulative conditional distributions, which are plotted 
sideways. The conditional distributions for each quintile are 
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the cumulative proportions of the individuals falling in that 
quintile who reached a specified proportion of the items in that 
section. Values for the specified proportion must be read from 
the left-hand vertical scale. 

Insert Figures 2, 3, 4 about here 

Although crude, these figures do demonstrate that this 
empirically-based model incorporates the number of items reached 
as a function of ability. For each separately timed section, 
there is a noticeable increase in the proportions of individuals 
completing more of the test as one looks across the quintiles 
from the lowest to the highest quintile. 

An Empirically-Based Model of Omits 

We assume the omitting behavior is a function of the 
ability to be measured by the test. We also assume an 
additional complexity -- that omitting also depends upon whether 
an examinee thinks she/he will get an item correct or incorrect. 
We need the same data as before, that is, the true response 
strings and true abilities for real examinees included in the 
calibration that produced our true item and person parameters . 

We also need additional data, that is, the true model - g enerat ed 
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response strings for the same examinees. This latter response 
string represents what the model predicts for each item for an 
examinee . 

For each item, we construct two sub-models. The first is 
for those individuals whose model-predicted response was 
correct; we take this to indicate that the examinee thought 
she/he would get the item right. The second is for those 
individuals whose model-predicted response was incorrect; we 
take this to indicate that the examinee thought she/he would 
get the item wrong. For each sub-model, for each quintile of 
the distribution of true ability, we compute the proportions of 
examinees who omit the item in the true response strings . We 
construct these models for each item only once, using our true 
item and person parameters, true response strings , and true 
model -generated response strings . To incorporate these models 
in subsequent simulations, we proceed as follows: 

1) For a true simulee ability, determine the model - 
generated response . 

2) For the corresponding sub-model, find the corresponding 
quintile in the correct ability distribution . 

3) Generate a random number between zero and one. 
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4) If the random number is less than the proportion of 
omits observed in the true response string, change the response 
in the simulee's model -generated response string to an omit. If 
the random number is greater than the proportion, do not change 
the response . 

The empirically-based models of omitting behavior are shown 
for a few selected items in Figure 5. There are two plots for 
each item -- one for those examinees whose model generated 
responses indicated that they would respond incorrectly, and a 
second for those examinees whose model generated responses 
indicated that they would respond correctly. For each of these 
plots, the frequency distribution of true abilities for those 
examinees with the appropriate model -generated response is 
plotted upside down on the horizontal axis, with vertical bars 
marking off the quintiles. Actual values for this frequency 
distribution must be read from the bottom vertical axis. Above 
the horizontal axis in each figure, the proportion of 
individuals in a quintile whose true response strings indicated 
an omit are plotted with a cross at the midpoint of a quintile. 
These proportions are to be read from the top vertical axis . 
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Insert Figure 5 about here 



There is some variation in omitting rates among the items 
displayed in Figure 5. Items 15, 55, 95 and 99 are hard items 
with more than 1000 omits (33%) in the full sample. Items 1 and 
16 are easy items with fewer than 10 omits (.33%) in the full 
sample. Items 50 and 91 are items of middle difficulty; the 
rates of omitting in the true response strings are moderate. 

The following table gives the true parameters for these items. 



I tern 
Number 


a 


b 


c 


Number of 

Omits in full sample 


15 


.9 


2.4 


.18 


>1000 


55 


.4 


2.6 


.13 


>1000 


95 


1.0 


1.4 


.25 


>1000 


99 


1.0 


2.0 


26 


>1000 


1 


.3 


-3.7 


.12 


<10 


16 


.6 


-2.8 


.12 


<10 


50 


1.2 


.0 


.23 


538 


91 


.8 


.0 


.10 


270 


Looking at 


these 


plots 


leads 


to a number of general 



conclusions. First, examinees who are modeled to get an item 
right tend to omit less frequently than those modeled to get an 
item wrong. This trend is most marked for those in the highest 
quintile of their respective ability distributions. Second, the 
rate of omitting is usually higher for lower ability, regardless 
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of the modeled response. This latter trend seems most 
consistent for those modeled to answer an item correctly. This 
model, then, reflects the two aspects we had hoped to 
incorporate, namely that omitting behavior is a function of 
ability and also a function of whether an examinee thinks that 
she/he will respond correctly. 

ihe Design of the Calibrations and Equatings 
The simulated responses from the six samples of simulees to 
the test form and equating section were combined into five 
concurrent LOGIST runs, each representing an experimental 
condition. The design of each LOGIST run was the same, and 
patterns in form the usual SAT data collection design presented 

in Figure 1 . 

Total Test or Anchor Test 



NEW EOl EQ2 OLPl 0LD2 
Sample lx x 

Sample 2 x x 

Sample 3 x x 

Sample Y x x 

(Y*4,5, or 6) 



Sample 1 was administered the new form and one anchor test 
(EQl) , Sample 2 was administered the new form and another anchor 
test (EQ2) , Sample 3 was administered the first anchor test 
(EQl) and an old form (OLDl) , and a final sample (either Sample 
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4, 5 or 6) was administered the second anchor test (EQ2) and 
another old form (OLD2) . All test forms (NEW, OLDl, OLD2) had 
identical true parameters, and all anchor tests (EQl and EQ2) 
had identical true item parameters. 

From the item parameter estimates derived from each of the 
LOGIST runs or from the observed- score data for the samples used 
in the runs, the new form was equated to each old form using the 
Tucker, Levine, Equipercentile through an anchor test, and 3-PL 
IRT equating methods. The two equatings were also averaged to 
produce a final equating. All old forms were placed on the SAT 
200 to 800 scaled score metric by the nonlinear equating 
originally derived for the SAT-Verbal form that serves as the 
source of the true item and person parameters. Projected scaled 
score means and standard deviations were computed for each 
single equating and each average using a sample of over 90,000 
examinees who took that SAT-Verbal form at its initial equating 
administration . 

The Scaling of Calibration Results 
Many of the comparisons made in this. study involve the 
estimated parameters obtained from separate LOGIST calibrations. 
However, each calibration will have results reported on a 
different metric, since LOGIST determines the reporting metric 
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by standardizing the ability estimates within a calibration. 

Therefore, the estimated parameters must all be placed on some 

common metric before such comparisons can be achieved. 

The metric of the true item and person parameters was 

chosen as the common metric within which to compare parameter 

estimates. The parameter estimates from each LOGIST calibration 

were transformed to this common metric by the characteristic 

curve transformation method of Stocking and Lord (1983), The 

transformations were based on the parameter estimates from each 

calibration and the true parameters for the 130 items (85 test 

items plus 45 anchor test items) taken by Sample 1, 

The Experimental Conditions 

The series of simulations were designed to study five 

experimental conditions, shown in the following table, which 

contains a letter for each experimental condition: 

True Ability Distribution 

Equivalent Unequal Equivalent by Matching 

Complete data A B 

Missing data CD E 

The data for all samples in a LOGIST run were either complete 
(conditions A and B) or contained missing data (conditions C, D, 
and E) . The final samples taking EQ2 and 0LD2 , Samples 4-6, 
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were drawn in the following fashion. Sample 4 was drawn 
randomly from the same population as the other samples 
(conditions A and C); Sample 5 was drawn randomly from the lower 
ability population (conditions B and D) ; and Sample 6 was drawn 
from the lower ability population to match the distribution of 
observed formula scores obtained by Sample 2 on EQ2 (condition 
E). 

Condition A, Complete Data and Equivalent Samples, is a 
benchmark condition in that, while unlikely to be realized in 
practice, it represents the best circumstances for any equating 
method. Condition B, Complete Data and Unequal Samples, 
provides for the exploration of the effects of different sample 
abilities while still maincaining the ideal situation of 
complete data for all simulees. This condition replicates the 
conditions of the Stocking and Eignor (1986) study, described in 
the introduction, Condition C, Missing Data and Equivalent 
Samples, is a more realistic condition in that samples now 
incorporate missing data. In this condition, however, samples 
have been chosen to be equivalent on the basis of an infallible 
criterion. Condition D, Missing Data and Unequal Samples, 
represents what is typically obtained in an SAT equating of NEW 
to 0LD2 in the absence of any further data manipulation. 
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Condition E, Missing Data and Matched Samples, represents the 
matching procedure employed in the Lawrence and Dorans (1988) 
study; that is, matching samples on the basis of a fallible 
criterion in an attempt to achieve the ideal condition of 
equivalent samples . 

RESULTS AND DISCUSSION 
Calibration Results 

Tables 2 through 6 contain descriptive statistics for the 
parameter estimates from each LOGIST calibration representing an 
experimental condition. In each table, the statistics for the 
item parameter estimates are given separately by test form or 
section. Statistics are also given for both the estimated and 
true abilities from each sample of simulees used in the 
calibration. These tables will be helpful in understanding some 
of the phenomena exhibited in the equating results. 

Insert Tables 2, 3, 4, 5, and 6 about here 



Equating Results 

The focus of this study has been on the effect of the 
various experimental conditions on a number of different linear 
and curvilinear observed and true -score equating procedures. 
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For convenience, we divide the discussion of these equating 
results into two parts. In the first part, we examine the 
information available from this study relevant to a particular 
phenomenon observed by Lawrence and Dorans (1988) in their IRT 
equating results. This discussion is focused only on 
experimental condition D, Missing Data and Unequal Samples, and 
experimental condition E, Missing Data and Matched Samples. In 
the second part, we consider the results for all equating 
procedures across all experimental conditions. 

An Exploration of the Lewis Hypothesis 

Lawrence and Dorans (1988) observed that when the "matched” 
sample is more able than the "random" sample, i. e., Sample 6 is 
more able than Sample 5, the mean estimated item difficulty for 
0LD2 is higher when the estimates are obtained from Sample 6 
than when obtained from Sample 5. When this is true, it 
automatically follows that the mean scaled score for NEW based 
on the matched sample calibration is lower than that based on 
the random- and -unequal sample calibration. 

Charles Lewis (personal, communication to Dorans, 1987) 
hypothesized the following circumstances to explain the 
difference in mean estimated item difficulties between the 
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random-and-unequal and matched conditions (experimental 
conditions D and E in the context of the current study) : 

1) Selecting Sample 6 from the same population as Sample 5 
(a lower ability population) to match Sample 2 on the basis of 
observed scores on EQ2 will produce a sample of higher true 
ability than Sample 5, but not as high as the mean true ability 
for Sample 2. Given this level of true ability, the Sample 6 
simulees will also have somewhat higher than expected observed 
scores on EQ2 (relative to Sample 5) , corresponding to positive 
mean error scores in classical test theory. 

2) The items in EQ2 will appear easier for Sample 6 than 
for Sample 2 because of the positive errors. LOGIST will try to 
reconcile these two sources of information about EQ2 items by 
estimating Sample 6 simulees to be more able than they actually 
are until the regressions of item score on estimated ability 
coincide for the two samples. 

3) Items in 0LD2 are also responded to by simulees in 
Sample 6, and by no other sample. If the simulees in Sample 6 
are thought to be more able than they actually are, then their 
estimated abilities will be shifted to the right on the ability 
metric. The values of the estimated difficulties for items in 
0LD2 will be relative to the estimated abilities for Sample 6, 
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and since these abilities are shifted to the right, the 
estimated difficulties will be also, making these items appear 
harder than they actually are, relative to items in other forms. 

4) In experimental condition D (Missing Data, Unequal 
Samples) of the current study, none of the distortions described 
above should occur. Thus the estimated difficulties for EQ2 for 
the two LOGIST calibrations should be approximately the same, 
while the estimated difficulties for the items in 0LD2 arising 
from the matched sample condition (E) should be systematically 
greater than the corresponding difficulties for the random- and- 
unequal sample condition (D) . 

Lawrence and Dorans (1988) presented a table of average 
values for estimated item parameters for one of the SAT- 
Mathematical forms they studied under both experimental 
conditions. As in the case described above, the old form sample 
obtained by the matching process was more able than the randomly 
selected old form sample. The average difficulty for the old 
form affected by the change in sampling is about .08 higher 
under the matched sampling condition than under the random 
sampling condition, which supports the Lewis hypothesis. 

Table 7 presents the same type of information as presented 
by Lawrence and Dorans, but for the current simulation. 
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However, in addition to the average values of item parameter 
estimates for the random -and -unequal case (D) and matched case 
(E) , the same information is also presented for the Missing 
Data, Equivalent Samples condition (C) , a condition that is 
equivalent to matching on an infallible criterion. As noted 
earlier, it is the results of this latter condition that the 
matching process is employed to achieve. 



Insert Table 7 about here 



Looking at the columns for item difficulty, we see that the 
average difficulty for the Missing Data, Matched Samples 
condition is .07 higher than the average difficulty for the 
Missing Data, Unequal Samples condition. In addition, there is 
little, if any, difference between the average difficulties for 
the other sections involved in the concurrent calibration. 
Differences between the averages of other item parameters are 
also small. These results replicate the Lawrence and Dorans 
(1988) results and support the Lewis hypothesis. 

Perhaps even more notable, however, is the comparison of 
these two conditions with the "ideal" condition: Missing Data, 

Equivalent Samples. For the test forms and equating sections 
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not affected by the sample selection, there are few, if any, 
differences among the averages of the item parameter estimates 
across all three conditions. There is a change of only .01 in 
average item difficulty for 0LD2, compared to the ideal 
condition, when there are true differences in ability. Matching 
samples on fallible criteria produces a much larger difference 
(.08) in average estimated difficulty. This suggests that such 
matching may introduce undesirable distortions in estimated item 
difficulties . 

A more detailed comparison of results from the unequal 
samples and matched samples conditions is shown in Figure 6. 

Each page of this multipage figure shows a scatterplot (top) and 
residuals (bottom) for the item parameter estimates for a 
particular test section. In all scatterplots , the matched 
condition results are on the vertical axis and the random-and- 
unequal sample condition results are on the horizontal axis. 

All residual plots are formed by subtracting the unequal sample 
condition results from the matched sample condition results. 



Insert Figure 6 about here 
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For the New form (NEW), EQl , and OLDl , there are only a few 
items whose results do not lie exactly on the 45-degree line. 
These items are different because in one run their c's were 
fixed at COMC (see Wingersky, Barton, & Lord, 1982) and in the 
other run they were not. For EQ2 , there is more scatter of the 
estimates around the 45 -degree line, and the plot of item 
discriminations shows that the discriminations are slightly 
higher in the random condition, confirming the differences 
between the means in Table 1 . For 0LD2 , there is even more 
scatter for all three item parameter estimates than seen for 
EQ2. The plot of the item difficulties shows that the estimates 
under the matched condition are generally slightly, but 
systematically, higher for almost all item difficulties. 

To examine the same type of information for real data, as 
opposed to the simulated data developed for this study, a 
particular SAT-Verbal form studied by Lawrence and Dorans (1988) 
was selected. The form was chosen because the reported 
differences showed that the average ability for the lower 
ability sample taking one old form was about 1/3 of a standard 
deviation below that for the new form. This resembles the 
simulated conditions of the current study. 
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Figure 7 shows the information for this form analogous to 
that shown for the simulated data in Figure 6. The calibration 
design for the chosen form was exactly the same as in the 
simulation, but in contrast to the simulation, each test form 
and equating section for the real -data calibration differed from 
each other. As in the simulated results, item parameter 
estimates for NEW, EQl and OLDl were not affected by the sample 
selection. Estimates for EQ2 and OLD2 were affected, and in 
much. the same way as the simulated results. The item difficulty 
estimates for OLD2 are slightly, but systematically, higher in 
the matched condition. These results, as do the simulation 
results, provide further evidence in support of the Lewis 
hypothesis . 



Insert Figure 7 about here 



Table 7 and Figures 6 and 7 suggest that if IRT equating is 
to be used, then the matching of samples based on a fallible 
criterion is not recommended. This selection produces results 
that differ more from the ideal condition of selection on an 
infallible criterion than do the results based on the use of 
samples that are unequal in true ability. At the same time. 
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this selection introduces an undesirable bias in the estimates 
of item difficulty for the old form. 

Equating Results for All Methods and All Conditions 
Table 8 shows the projected scaled score means and standard 
deviations for all individual equatings performed and for the 
averages. Figure 8 plots the projected scaled score means for 
the individual equatings (not the averages). The left side of 
this figure gives the results of the equatings of the New Form 
to Old Form 1, and the right side gives the results for the 
equatings of the New Form to Old Form 2. The experimental 
conditions are positioned along the horizontal axis. The 
projected scaled score means are read from the vertical axis. 
For each experimental condition, the projected scaled score 
means are labeled with a T for Tucker, L for Levine, E for 
Equipercentile , and I for IRT equating. The points for a 
particular equating method are connected with straight lines to 
make the plots easier to read. 



Insert Table 8 and Figure 8 about here 
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Both Table 8 and Figure 8 show that the differences among 
projected mean scaled scores are generally small, but with a few 
exceptions to be discussed later. This is not surprising since 
all test forms have the same true item parameters in this 
simulation; only samples have been changed. Thus equating the 
NEW to OLDl or to 0LD2 is equivalent to equating a test to 
itself, using identical anchor test sections. The importance of 
these small differences is not possible to judge since 
approximate standard errors have not been developed for all 
methods (i.e., the IRT standard errors have not been developed 
to date). 

To evaluate these results, it seema useful to compare the 
results of each equating method across experimental conditions 
to its own value in the “benchmark" condition. This condition, 
shown to the far left of each subplot, is the one in which data 
are complete for each simulee and all samples of simulees are 
drawn from the same ability distribution. 

New Form Equated to Old Form 1 

Conventional equating methods for equating NEW to OLDl are 
not affected by different samples taking 0LD2 since these 
samples do not enter into the equating. Thus, the equated means 
for the conventional methods are identical for conditions 
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involving complete data (A and B) , and also identical, but 
different, for conditions involving missing data (C, D, and E) . 

In contrast, since all test forms are calibrated concurrently, 

IRT equating results vary slightly across conditions in which 
the samples taking the other old form vary. 

All equating methods are affected by the presence of 
missing data in both the NEW and OLDl samples (conditions C vs. 

A and conditions D vs. B) , although IRT equating is less 
affected than conventional methods. The kind of missing data 
modeled here, in which both the number of items reached and 
omitted are functions of ability, tends to make all simulees 
appear slightly less able and the tests to appear slightly 
harder. In the IRT case, the new form is harder than the old 
form when there is complete data (see Table 2 or Table 3). When 
missing data is introduced, both test forms are harder, but 
differentially so, and the old form becomes even easier than the 
new form (see Table 4 or Table 5). Thus the new form scaled 
score mean is raised by introducing missing data. 

For the IRT equatings, all other effects are not 
explainable on the basis of means of estimated item parameters, 
but may be explainable by slight changes in the distributions of 
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item parameter estimates due to what is most likely sampling 
variability. 

If comparison with respective benchmark conditions is a 
reasonable criteria, then IRT shows the least variation across 
conditions studied. 

New Form Equated to Old Form 2 

These equatings, shown in the right-hand subplot of Figure 
8, are the interesting ones --by design they are most affected 
by the experimental conditions. As seen in Figure 8 and also in 
Table 8, the benchmark conditions for all equating methods are 
different from the benchmark conditions, for the equating of 
NEW to OLDl . The IRT benchmark conditions are most different -- 
over two scaled score points; the Equipercentile benchmark 
conditions are least different -- less than a tenth of a scaled 
score point. 

Perhaps the most striking aspect of these equatings is the 
sensitivity of observed- score equating methods to differences in 
true sample ability. The introduction of unequal samples, 
whether in the complete data situation (conditions B and A) or 
in the missing data conditions (conditions D and C) has the 
largest impact on Tucker equating, and less but substantial 
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impact on Equj.percentile equating. Of the remaining two 
methods, Levine equating is more affected than the IRT equating. 

As in the OLDl equatings, the introduction of missing data 
(conditions C vs . A and conditions D vs . B) also impacts the 
projected means, making them slightly higher for all equating 
methods. The explanation for the IRT results offered previously 
for the equating of NEW to OLDl seems to hold here also. 

The Lewis hypothesis is again demonstrated by the slight 
decrease in the projected mean for IRT equating from the random- 
and-unequal sample condition (D) to the matched sample condition 
(E) . Tucker and Levine equatings are identical, as they must 
be, under matched sampling conditions, and the Equipercentile 
equating is close to them. 

If the benchmark condition is used as a criterion, it seems 
clear that IRT equating varies least across all experimental 
conditions. If the Missing Data, Equivalent Samples condition 
(C) is a more practical criterion, in other missing data 
conditions (D and E) , all equating methods except Tucker come 
closer to this criterion when random -and -unequal samples are 
used than when matched samples are used. The matching process 
appears to improve the Tucker method, while making the other 
methods worse. 
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These results suggest that if Levine, Equipercent ile , or 
IRT equatings are to be used, more reasonable results are 
obtained using random- and -unequal samples. If Tucker equating 
is to be used, better results are obtained with matched samples 
than with random- and- unequal samples. However, if the decision 
concerning the choice of equating procedure is to be made after 
the sampling decision, then these results suggest that it is 
better to use the random- and -unequal sampling that typically 
occurs in SAT equating situations, and never select the Tucker 
method. 

CONCLUSIONS 

The conclusions reported in this study must be considered 
tentative since they are based on a single sequence of 
simulations, and will remain tentative until they are replicated 
by other studies. Further, the results should be examined from 
the viewpoint that response data were generated according to the 
3-PL model, with some specific model violations introduced to 
incorporate missing data. These circumstances may favor the 3- 
PL IRT equating results. In addition, it is not possible to 
draw definitive conclusions about the importance of the equating 
differences until some other study produces estimates of 
standard errors for all equating methods studied. 
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With the above in mind, the following tentative conclusions 
may be offered based on the results of this study: 

1. If IRT true -score equating procedures are to be 
employed, matching of samples based on a fallible 
criterion, such as an anchor test observed- score 
distribution, is not recommended. This selection 
produces results that differ more from the ideal 
condition of selection on an infallible criterion than 
do the results based on the use of samples of unequal 
ability. Such selection also introduces an undesirable 
bias in the estimates of item difficulty for the old 
form. 

2. If Levine equally reliable or Equipercentile through an 
anchor test observed- score equating procedures are to 
be employed, more reasonable results are also obtained 
from use of samples of unequal ability and matching is 
not recommended. Only for Tucker equating are better 
results obtained when samples are matched on a fallible 
criterion. 

Finally, it is reasonable to ask how the results of this 
study compare to the real data results observed in the Lawrence 
and Dorans (1988) study. Their study involved looking at only 
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conditions D and E of the equating of NEW to 0LD2 in Figure 8; 
they were, however, able to observe the results for a number of 
different forms. The results of this study for conditions D and 
E of the equating of NEW to 0LD2 are not totally inconsistent 
with the Lawrence -aiid Dorans findings for SAT-Verbal, and, in 
fact, the results reported in this study closely correspond to 
the results for one of the forms studied by Lawrence and Dorans. 
The results from this study are somewhat inconsistent with the 
Lawrence and Dorans findings for SAT-Mathematical , where little 
variation was found across the Tucker results for conditions D 
and E of the equating of NEW to 0LD2 . Further investigations 
are presently being planned to attempt to reconcile the 
inconsistency of equating results that appear to exist for the 
Tucker method for SAT-Verbal and SAT-Mathematical. 
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Summary Statistics for the Estimated Item and Person Parameters Estimated for Condition A: Complete Data and 

Equivalent Samples. Also Summary Statistics for the True Abilities of Samples 1 through 4. This Condition is 

the Benchmark Condition. 
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Means and Standard Deviations of Item Parameter Estimates for the 
Three Missing-Data Conditions -- E * Equivalent Samples (Condition C) , R = Random-and-Unequal 

Samples (Condition D) , M « Matched Samples (Condition E) 
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Table 8: Projected Scaled Score Means and Standard Deviations 

for All Equating Methods and All Experimental Conditions 

Tucker 



Condi- NEW to OLDl NEW to 0LD2 Avera ge 





tion 


Mean 


S.D. 
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S.D. 
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S.D. 


Benchmark 
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Missdata , equal 


C 


422, 


,10 


111. 


14 


421, 


,71 


109. 


14 


421, 


,89 


110. 


13 


Missdata, uneq 


D 


422, 


,10 


111. 


14 


415, 


,35 


107. 


92 


418, 


,71 


109. 


07 


Missdata, matched 


E 


422, 


,10 


111. 


14 


417, 


,95 


108. 


92 


420, 


,02 


110. 


02 



Levine 





Condi- 


NEW to 


. OLDl 


NEW to 


. OLD 2 


Average 




tion 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


Benchmark 


A 


420.89 


112.30 


420.79 


107.55 


420.83 


109.91 


Compdata, uneq 


B 


420.89 


112.30 


420.06 


106.97 


420.47 


109.62 


Missdata, equal 


C 


422.31 


110.87 


421.15 


108.42 


421.73 


109.63 


Missdata, uneq 


D 


422.31 


110.87 


420.42 


108.01 


421.36 


109.43 


Missdata , matched 


E 


422.31 


110.87 


417.95 


108.92 


420.13 


109.88 










Equipercentile 







Condi- 

tion 



Benchmark A 
Compdata,uneq B 
Missdata, equal C 
Missdata, uneq D 
Missdata , matched E 



NEW to OLDl NEW 



Mean 


S.D 




Mean 


420. 


,74 


112. 


77 


420, 


.82 


420, 


,74 


112. 


77 


418, 


.76 


422, 


,00 


110. 


67 


421, 


.05 


422, 


.00 


110. 


67 


419 


.04 


422 


.00 


110. 


67 


417 


.82 



OLD 2 Average 



S.D. 


Mean 


S.D. 


107. 


85 


420. 


81 


110. 


24 


107. 


39 


419, 


,78 


110. 


00 


108. 


24 


421, 


,52 


109. 


38 


108. 


02 


420, 


,52 


109. 


28 


108. 


93 


419, 


.90 


109. 


72 



Condi- NEW to OLDl 





tion 


Mean 


S.D. 


Benchmark 


A 


422.12 


111.10 


Compdata , uneq 


B 


422.35 


110.99 


Missdata, equal 


C 


422.52 


110.37 


Missdata, uneq 


D 


422.77 


110.17 


Missdata , matched 


E 


422.50 


110.33 



IRT 



NEW to 0LD2 Average 



Mean 


S.D. 


Mean 


S.D. 


419, 


,79 


109. 


13 


420. 


,95 


110. 


12 


419. 


,70 


109. 


56 


420. 


,76 


110. 


27 


420, 


.46 


108. 


94 


421 


.49 


109. 


65 


420 


.12 


109. 


90 


421 


.45 


110. 


04 


419 


.07 


108. 


68 


420 


.79 


109. 


50 




58 



Equating Procedures 
49 

Figure 1. Data collection design for equating the SAT 



Total Test or Anchor Test 





NEW 


EOl 


E02 


OLDl 0LD2 


Sample 1 


X 


X 






Sample 2 


X 




X 




Sample 3 




X 




X 


Sample 4 






X 


X 



Notes: An X denotes the specific total test and anchor 

test taken by a specific sample. 

Samples 1 and 2 are random samples from the same total 
group . 

Samples 1 and 3 are samples from different total groups 
that are similar in ability. 

Samples 2 and 4 are samples from different total groups 
that are dissimilar in ability. 
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Figure 2. The cumulative distributions of the percentage of items reached by 
quintile of true ability for SAT-V, Section 1. 
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Figure 3. The cumulative dis tr ibutions of the percentage of items reached by 
quintile of true abilitv for SAT-V, Section 2. 
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Figure 4. The cumulative distributions of the percentage of items reached by 
quintile of true ability for SAT-V, anchor test section. 
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Equating Procedures 

Figure 5. For selected items > the proportion of omits in 
true response strings, separately by quintiles for right/wrong 
modeled responses. 
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Figure 5, continued. For selected Itens. the proportion of 
omits in true response strings, separately by quintiles for 54 

right/wrong modeled responses. 
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Equating Procedures 

Figure 5, continued. For selected items, the proportion of 
omits in true response strings, separately by quintiles for 56 

right/wrong modeled responses. 








1 

1 

t 


: 

: ^ 
I ; 1 ! ! 


ir 

1 1 1 -i 



“ 6-3 0 3 

50 WRONG 





1 1 X i 


; 

: ^ 
1 ' ' 1 ' ' f 


— » • — 



1 1 I T r ' 

“ 6-30 3 

50 RIGHT 





63 




o 

ERIC 



70 



Equating Procedures 

Figure 5 , continued. For selected items , the proportion of 
omits in true response strings, separately by quintiles for 58 

right/wrong modeled responses. 
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Figure 5, continued. For selected items, the proportion of 
omits in true response strings, separately by quintiles for 59 

right/vrong modeled responses. 
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selected Items, the proportion of 

rlBht/wrn*^ Strings, separately by quintiles for 

right/wrong modeled responses. 
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Figure 6a. For simulated data, item parameter estimates for matched vs. random- and-unequal 
conditions and residuals, for NEW. 
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Figure 6b. For simulated data, item parameter estimates for matched vs. random- and- unequal 
conditions and residuals, for EQl. 
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Figure 6c. For simulated data, item parameter estimates for matched vs. random-and-unequal 
conditions and residuals, for OLDl. 






Figure 6d. For simulated data, item parameter estimates for matched vs. random-and-unequal 
conditions and residuals, for EQ2. 
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Figure 6e. For simulated data, item parameter estimates for matched vs. random -and -unequal 
conditions and residuals, for 0LD2, 
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Figure 7a. For real data, item parameter estimates for matched vs. random- and-unequal 
conditions and residuals, for NEW. 
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Figure 7b. For real data, Item parameter estimates for matched vs. random- and-unequal 
conditions and residuals, for EQl. 
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Figure 7c. For real data, item parameter estimates for matched vs. random -and -unequal 
conditions and residuals, for OLDl, 
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Figure 7d. For real data, item parameter estimates for matched vs. random- and-unequal 
conditions and residuals, for EQ2 . 
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Figure 7e. For real data, item parameter estimates for matched vs. random- and-unequal 
conditions and residuals, for 0LD2. 
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